Skip to content

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Aug 8, 2025

This commits adds a new parameter to the k-means result that contains the current count of vectors in a cluster. This array is always up-to-date so at anytime it contains the number of vectors assign to a cluster. This array is used in the places where we are counting the number of vectors assigned, both in the codec as well as in the algorithm itself. But more important, this will allow us to limit the number of vectors in a cluster if we wish to, in order to build more balanced clusters.

I did not notice any performance regression or changes in recall after this change. The only difference with the previous version is that when we update the centroids after a assignment step, we update the centroids using all the assigned vectors, while before we were using only the sampled vectors.

@iverase iverase changed the title Add the current count of vector in a cluster in hierarchical k-means Add the current count of vectors in a cluster in hierarchical k-means Aug 8, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Aug 8, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure the reason for this change. Is this to combat scenarios with many duplicate vectors? If so, I think we might just want to encode doc-ids per vector block

I am not sure splitting up per centroid buys us much as in situations where many vectors are the same, we would still need to search MORE centroids to ensure adequate recall (unless filtering was applied). Then if filtering was applied, we know the postings list is sorted by doc id, so we can skip blocks without decoding...

@iverase
Copy link
Contributor Author

iverase commented Aug 25, 2025

I agree with you @benwtrent, not sure if this change is necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants